Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node metrics #948

Open
wants to merge 35 commits into
base: master
Choose a base branch
from
Open

Conversation

cody-littley
Copy link
Contributor

@cody-littley cody-littley commented Dec 3, 2024

Why are these changes needed?

Adds metrics to the v2 DA node.

Checks

  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
@cody-littley cody-littley self-assigned this Dec 3, 2024
@cody-littley cody-littley marked this pull request as ready for review December 6, 2024 16:48
Signed-off-by: Cody Littley <[email protected]>
@cody-littley cody-littley marked this pull request as draft December 6, 2024 17:50
@cody-littley cody-littley marked this pull request as ready for review December 6, 2024 18:24
Signed-off-by: Cody Littley <[email protected]>
node/config.go Show resolved Hide resolved
@@ -101,6 +110,8 @@ func (s *ServerV2) StoreChunks(ctx context.Context, in *pb.StoreChunksRequest) (
return
}

s.metrics.ReportStoreChunksDataSize(size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the store operation gets reverted in L125?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule of thumb, should we report incremental metrics if the operation as a whole fails? Or should we only report metrics for an operation if it is successful? (in another PR, you suggested that I should report latencies even when there are failures).

I can make this only report if the request ends up being valid, but I want to be consistent with the way we handle scenarios like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, this will be left the way it currently is.

node/grpc/server_v2.go Outdated Show resolved Hide resolved

for m.isAlive.Load() {
var size int64
err := filepath.Walk(m.dbDir, func(_ string, info os.FileInfo, err error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this thread safe?

Copy link
Contributor Author

@cody-littley cody-littley Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is almost certainly not thread safe (i.e. if level DB deletes a file or a directory mid-walk, then filepath.Walk() will return an error). My hope was that if the race condition was sufficiently rare, we could still extract meaningful metrics data.

Currently, this will log an error whenever this method is unable to fetch new data. Will an error log cause problems if it triggers every once in a while? If so, should this be downgraded to a a logger.info() call?

Unfortunately, levelDB doesn't expose API that tells you the size of the DB (that I know of). My reasoning was that this metric would be sufficiently valuable to justify a hacky collection method.

In theory, we could have the levelDB wrapper track the quantity of data, at the cost of some extra book keeping (every DB modification would need to update a special size key-value pair). This wouldn't tell us the size of the files on disk (which may vary depending on things like compaction and indexes), but would give us a very good idea of the approximate size if the DB. If I implemented such a thing, it would need to be in a stand alone PR.

The final option would be to just delete this metric entirely. I'll defer to your judgement on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metric is now removed.

node/grpc/v2_metrics.go Outdated Show resolved Hide resolved
node/grpc/server_v2.go Outdated Show resolved Hide resolved
node/flags/flags.go Outdated Show resolved Hide resolved
node/grpc/server_v2.go Outdated Show resolved Hide resolved
node/grpc/v2_metrics.go Outdated Show resolved Hide resolved

func (m *V2Metrics) ReportStoreChunksLatency(latency time.Duration) {
m.storeChunksLatency.WithLabelValues().Observe(
float64(latency.Nanoseconds()) / float64(time.Millisecond))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float64(latency.Milliseconds()) should work

Copy link
Contributor Author

@cody-littley cody-littley Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm intentionally not using that (@ian-shim made the same suggestion on another PR 😜). latency.Milliseconds() returns an int, meaning we lose all sub-millisecond fidelity in the measurement. For many metrics this isn't a hug deal (e.g. if you are measuring something that takes 100s of milliseconds), but for some things we are measuring the precision is nice to have.

Let me know if you'd like to discuss this further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on what the expected number we will deal with here. If it's ultra low latency and sub-ms matters, Microseconds() may be an option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the sake of consistency, I've been converting everything to ms for reporting latencies.

Although I can often make an educated guess as to the expected latency for an operation, absent experimental data it's only a guess. If I guess wrong, then we could end up in a situation where we guess wrong.

Would this be a topic you think worthwhile to schedule a short call to discuss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, all of these now use a utility method in common.

node/store_v2.go Show resolved Hide resolved
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants